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Abstract 


We  investigate  tiie  asymptotic  behavior  of  a  general  class  of  on-line  Principal 
Component  Analysis  (PCA)  learning  algorithms,  focussing  our  attention  on  the 
analysis  of  two  algorithms  which  have  recently  been  proposed  and  are  based 
on  strictly  local  learning  rules.  We  rigorously  establish  that  the  behavior  of 
the  algorithms  is  intimately  related  to  an  ordinary  differential  equation  (ODE) 
which  is  obtained  by  suitably  averaging  over  the  training  patterns,  and  study  the 
equilibria  of  these  ODEs  and  their  local  stability  properties.  Our  results  imply 
in  particular  that  local  PCA  algorithms  should  always  incorporate  hierarchical 
rather  than  more  competitive,  symmetric  decorrelation,  for  reasons  of  superior 
performance  of  the  algorithms. 


1      Introduction 

The  ability  to  extract  the  main  features  inherent  in  complex,  high-dimensional 
input  data  streams  is  of  fundamental  importance  to  many  information  process- 
ing systems.  Such  "dimensionality  reduction"  occurs  e.g.  as  a  preprocessing 
stage  for  efficient  pattern  recognition  and  classification,  helps  eliminating  dis- 
turbing noise  or  information  redundancy,  and  is  necessary  to  allow  for  further 
transmission  of  the  relevant  information  the  input  signal  contains  if  not  enough 
transmission  channels  are  available. 

Generally  speaking,  optimal  feature  extraction  can  be  described  as  con- 
structing a  function  F  which  compresses  a  (f-dimensional  input  vector  x  into 
a  p-dimensional  output  vector  y  =  F{x),  where  p  <  d  and  usually  p  <^  d,  such 
that,  relative  to  some  performance  criterion,  y  contains  as  much  information 
about  X  as  possible.  If  the  mean  squared  error  of  the  best  linear  estimate  of  x 
given  y  (the  "linear  reconstruction  error")  is  used  as  a  criterion,  this  leads  to  a 
statistical  method  called  Principal  Component  Analysis  (PCA);  see  Bourlard  & 
Kamp  (1988),  Linsker  (1988),  Sanger  (1989),  Baldi  k  Hornik  (1991).  PCA 
is  one  of  the  simplest  and  most  general  purpose  feature  extraction  techniques 
which  extracts  information  by  finding  the  directions  in  which  the  inputs  exhibit 
most  significant  variation.  The  PCA  outputs  are  given  as  y  =  Ax,  where  .4  is 
a.  p  X  d  matrix  such  that  the  rows  of  A  span  the  same  subspace  of  IR'^  as  the 
eigenvectors  associated  with  the  p  largest  eigenvalues  of  the  input  covariance 
matrix,  see  e.g.  Baldi  &i  Hornik  (1991). 

A  variety  of  PCA  learning  algorithms  have  been  proposed  within  the  last 
decade.  If  upon  presentation  of  a  new  learning  pattern  i,  we  modify /I  according 
to 

AA  =  y{yx'  -yy'A),  y  =  Ax,  (1) 

(in  what  follows,  y  is  the  learning  rate  and  '  denotes  transpose),  we  obtain  an 
algorithm  introduced  independently  by  Williams  (1985)  as  the  Symmetric  Er- 
ror Correction  Algorithm  (SEC),  by  Baldi  (1988)  as  a  symmetric  simplification 
of  the  Back  Propagation  algorithm  for  a  linear  d-p-d  architecture  in  autoeisso- 
ciative  mode,  and  by  Oja  (1989)  as  the  subspace  algorithm.  The  Generalized 
Hebbian  Algorithm  (GHA)  introduced  by  Sanger  (1989)  updates  .4  by 

A^  =  7(yx'  -  lower(yy'  )A),  (2) 

where  the  "lower"  operator  sets  all  entries  above  the  main  diagonal  to  zero. 

Clearly,  in  both  algorithms,  the  first  additive  term  is  just  hebbian  learning 
performing  gradient  descent  on  an  energy  function  which  maximizes  the  sum 


1 


of  the  output  unit  variances,  whereas  the  second  term  tends  to  keep  the  weight 
matrix  .4  constrained  suitably;  for  the  symmetric  algorithm,  it  keeps  AA'  close 
to  the  p-dimensional  identity  matrix  (Baldi  &  Hornik,  1991),  and  in  Sanger's 
algorithm,  it  performs  Gram-Schmidt  orthonormalization  on  the  rows  of  .4, 
thereby  hierarchically  decorrelating  the  outputs.  Both  algorithms  are,  although 
only  implicitly,  already  contained  in  Oja  &;  Karhunen  (1985).  For  the  one-unit 
case  (p  =  1),  they  both  reduce  to  the  algorithm  first  introduced  by  Oja  (1982) 
as  a  small  learning  rate  first  order  approximation  to  hebbian  learning  with 
additional  euclidean  normalization  of  the  weight  vector;  in  the  sequel,  we  shall 
always  refer  to  this  algorithm  as  "Oja's  one-unit  algorithm". 

Recently,  Foldiak  (1989)  and  Rubner  k,  Tavian  (1990)  have  introduced  two 
new  algorithms  which  are  based  on  a  combination  of  Oja's  one-unit  algorithm 
applied  to  each  of  the  rows  of  A  (or,  roughly  equivalent  to  that,  hebbian  learn- 
ing with  rowwise  euclidean  normalization)  and  some  lateral  inhibition  mecha- 
nism designed  for  decorrelating  the  outputs.  Hence,  the  network  architectures 
employed  for  these  algorithms  utilize  an  additional  set  of  connection  weights 
accounting  for  decorrelation.  Learning  of  A  is  strictly  local  in  the  sense  that 
the  modification  of  the  i-th  row  of  .4  only  depends  on  the  input  x  and  the  i-ih 
network  output;  for  tlie  remaining  weights,  anti-hebbian  learning  is  used,  which 
again  is  a  very  simple,  local  rule.  Due  to  the  locality  of  these  learning  mecha- 
nisms, it  has  been  argued  that  these  algorithms  be  "biologically  more  plausible" 
than  the  subspace  algorithm  or  GHA. 

Our  concern  in  the  present  paper  is  a  rigorous  analysis  of  the  properties  of 
these  local  PCA  feature  extraction  algorithms  for  the  case  where  a  large  number 
of  training  samples  is  available.  Such  an  analysis  is  usually  based  on  the  claim 
that,  assuming  that  the  weight  changes  are  sufficiently  small,  the  sequence  of 
weights  generated  by  the  learning  algorithm  can  be  approximated  by  the  so- 
lution paths  of  an  ordinary  differential  equation  (ODE)  which  is  obtained  by 
"averaging  over  all  patterns",  and  that  the  weights  converge  to  the  asymptoti- 
cally stable  equilibria  of  this  ODE.  We  shall  provide  a  precise  result  supporting 
this  claim  for  the  case  where  the  training  patterns  are  independent  centered 
random  variables  with  the  same  covariance  matrix  and  the  learning  rates  tend 
to  zero  at  a  suitable  rate.  In  particular,  we  describe  the  appropriate  averag- 
ing procedure  for  local  algorithms  based  on  a  feedback  architecture,  such  as 
Foldiak's.  In  addition  to  that,  we  present  a  stability  analysis  of  the  equilibria  of 
the  ODEs  associated  with  the  algorithms.  In  particular,  we  shall  establish  that 
for  symmetric  lateral  inhibition  mechanisms  between  the  outputs,  as  the  archi- 
tecture originally  introduced  in  Foldiak  (1989),  the  desired  limit  points  are  not 


asymptotically  stable  equilibria  of  the  associated  ODE.  Therefore,  hierarchical 
decorrelation,  although  clearly  disallowing  for  "competition"  and  lacking  sym- 
metry, should  be  favored  over  symmetric  decorrelation,  for  reasons  of  superior 
performance. 

This  paper  is  organized  as  follows.  Section  2  introduces  a  general  class  of 
local  PCA  feature  extraction  algorithms  which  contains  the  ones  introduced 
in  Foldiak  (1989)  and  Rubner  &;  Tavian  (1990).  Section  3  describes  a  precise 
method  of  associating  an  ODE  to  on-line  learning  algorithms.  Results  are  given 
in  section  4.  Section  5  contains  some  additional  remarks.  All  proofs  are  deferred 
to  the  appendix. 

2      Local  PCA  Algorithms 

One  class  of  local  PCA  feature  extraction  algorithms,  including  the  algorithm 
introduced  by  Rubner  &i  Tavian  (1990),  can  be  described  as  follows,  cf.  Baldi  L 
Hornik  (1991),  Kuan  &;  Hornik  (1990).  Using  an  additional  linear  layer  for 
decorrelation,  the  network  output  is  computed  as 

y  =  QAx, 

where  Q  is  a,  p  x  p  matrix  which  performs  the  decorrelation. 

Upon  presentation  of  a  new  learning  pattern,  A  is  updated  using  either 
hebbian  learning  with  rowwise  normalization,  or,  basically  equivalent  thereto  if 
the  learning  rates  are  small,  using  Oja's  one-unit  algorithm  applied  to  each  of 
the  rows  of  A,  which  can  compactly  be  written  as 

AA  =  j{yx'  -dmgiyy')A).  (3) 

Here,  the  "diag"  operator  sets  all  off-diagonal  entries  to  zero.  In  the  sequel,  we 
shall  also  use  the  "subdiag"  operator,  which  sets  all  entries  on  or  above  the  main 
diagonal  to  zero,  and  the  "offdiag"  operator,  which  sets  all  diagonal  entries  to 
zero. 

Using  the  decorrelation  mechanism  proposed  by  Barlow  <k  Foldiak  (1989), 
Q  is  written  as  Q  =  I  +  W  where  /  is  the  identity  matrix  and  [V  is  symmetric 
with  zero  diagonal  and  updated  using  the  simple,  anti-hebbian  learning  rule  (as 
in  the  novelty  filter  of  Kohonen  (1984)) 

AM^  = -;iofrdiag(yy').  (4) 

Alternatively,  hierarchical  (i.e.  in  some  sense  Gram-Schmidt  type)  decorrelation 
can  be  employed,  which  is  accomplished  upon  writing  Q  =  I  -\-  W ,  where  now 


W  is  subdiagonal  (i.e.,  all  entries  of  W  which  are  on  or  above  the  main  diagonal 
are  zero),  and  updating  W  according  to 

AW  =  -^subdiag(yy').  (5) 

The  architecture  introduced  by  Foldiak  (1989)  is  depicted  below;  white  cir- 
cles indicate  hebbian,  black  circles  anti-hebbian  connections. 


The  network  architecture  proposed  by  Foldiak 

Here,  the  network  outputs  are  the  sum  of  weighted  network  inputs  and  the 
weighted  feedback  received  from  the  other  output  units,  such  that,  upon  pre- 
sentation of  an  input  i,  the  outputs  are  updated  according  to 

Ax  +  Wy^y,  (6) 

where  W  is  the  p  x  p  matrix  of  lateral  connection  strengths. 

Initially,  W  =  O  and  A  is  "random".  Foldiak  suggests  updating  A  and  W 
according  to  rules  (3)  and  (4),  respectively,  such  that  W  is  kept  symmetric  with 
zero  diagonal  throughout  the  learning  process,  and  claims  (p.  402)  that  when 
an  input  is  presented  to  the  network,  the  units  settle  to  a  stable  state  for  which 

y  =  Ax  -f  Wy, 

or 

y  =  {I  -  W)-^Ax. 

However,  this  is  not  necessarily  true  (see  also  Baldi  k,  Hornik,  1991).  Let  us 
write  y{k)  for  the  network  output  after  k  updating  cycles  using  (6),  with  fixed 
input  X  and  initial  output  y(0).  Clearly, 

y{k)     =     Wy{k-\)^Ax 


=     W''y{0)  +  {I  +  W  +  ---+W''-^)Ax. 


To  ensure  convergence  of  !/(^-)  to  (/—  VV)~^Ax  as  A:  — '  oo,  we  thus  need  that  all 
eigenvalues  of  W  are  less  than  one  in  absolute  value,  which  is  not  guaranteed 
by  the  algorithm.  Even  if  the  algorithm  is  modified  accordingly,  it  would  still 
require  infinitely  many  cycles  to  converge  to  the  stable  state,  which  is  of  course 
computationally  infeeisible  for  real-time  applications.  (Of  course,  the  linear 
system  (/  —  W)y  =  Ax  could  be  solved  explicitly  in  finite  time;  but  then  the 
architecture  is  no  longer  self-contained,  and  the  particularly  attractive  feature 
of  performing  only  simple  local  computations  is  lost.) 

Both  problems  can  be  overcome  by  using  asymmetric  (=hierarchical)  decor- 
relation,  i.e.  using  learning  rule  (5)  rather  than  (4),  which  together  with  the 
initialization  W  =  O  keeps  W  strictly  subdiagonal  throughout  the  learning  pro- 
cess. In  fact,  in  this  case,  A  =  0  is  the  only  eigenvalue  of  W ,  and  it  is  easily 
seen  that  the  {i,j)-th  entry  of  W''  vanishes  ii  i  <  j  +  k  such  that  in  particular, 
\V^  —  O  for  aW  k  >  p  and 


(/-  ^y)-l  ^  I+W  +  ---+WP 


-1 


Hence,  after  p  updating  cycles,  the  stable  state  is  reached,  irrespective  of  the 
initial  network  output  y(0).  Interestingly  enough,  it  will  be  shown  in  section  4 
that  if  asymmetric  decorrelation  is  employed,  then,  during  the  learning  period, 
it  is  enough  to  perform  at  least  2  cycles  before  updating  the  weight  matrices, 
thereby  making  the  algorithm  "quicker". 

All  algorithms  introduced  thus  far  are  of  the  following  general  form.  Upon 
presentation  of  a  new  learning  pattern  x,  the  network  output  y  is  updated 
according  to 

PiW)y  +  Q{W)Ax^y,  (7) 

where  P{W)  and  Q{W)  are  polynomials  in  the  p  x  p  matrix  W  which  satisfy 
P(0)  =  O  and  Q{0)  =  I ,  and  then  A  and  W  are  updated  as 

AA     =     j(yx' -^{yy')A),  (8) 

AW     =     /ir>(yy'),  (9) 

where  4>  and  Q  are  suitable  linear  (selection)  operators  on  the  space  of  all  p  x  p 
matrices;  as  initializations,  we  take  W  =  O,  y  =  0,  and  A  as  "random".  In 
fact,  taking  Q  =  O  and  <I>  as  the  identity  mapping,  we  obtain  the  subspace 
algorithm  (1);  the  choice  Q  =  O  and  $  =  lower  gives  Sanger's  GHA  (2).  If 
both  P  and  Q  are  nonzero,  the  network  outputs  receive  feedback  from  previous 
outputs. 


All  local  algorithms  use  <5  =  diag.  For  the  algorithm  of  Rubner  &:  Tavian, 
P  =  O,  Q{IV)  =  I  +  W,  ^  =  diag  and  Q  =  -subdiag.  Real-time  implementa- 
tions based  on  Foldiak's  architecture,  which  perform  only  a  finite  number,  say  k, 
of  output  updating  cycles  before  updating  the  weight  matrices,  use  P(W)  =  W^, 
Q{W)  ^  I  +  [V  +  ■■■  +  W''-\  ^  =  diag,  and  Q  =  -subdiag  or  -ofTdiag. 

Of  course,  the  above  class  of  algorithms  could  be  enlarged  by  allowing  for 
more  general  functions  P  and  Q;  for  example,  using  Foldiak's  original  idea  to 
let  the  outputs  settle  into  the  stable  state  before  updating  the  network  weights 
corresponds  to  the  choice  P  =  O  and  Q(W)  =  (I  —  \V)~^  (formally,  the  feedback 
is  eliminated  by  stabilization).  However,  as  already  pointed  out,  this  choice  is 
computationally  infeasible  if  combined  with  (4),  and  contained  in  the  polynomial 
setting  if  W  is  kept  subdiagonal  using  (5).  Therefore,  generality  is  not  really 
restricted  by  considering  only  the  case  where  P  and  Q  are  matrix  polynomials 
in  W. 


3      The  ODE  Method 

On-line  network  learning  algorithms  are  of  the  general  form 

en  =  n(en-i-¥7nhi:n,en-i)).  (10) 

where  9  is  the  vector  of  network  parameters  to  be  learned  and  6^  is  its  estimate 
after  n  updating  steps,  ;„  is  the  training  pattern  and  jn  the  learning  rate  used 
at  the  n-th  learning  step,  /?(•,)  is  a  function  characteristic  of  the  algorithm, 
and  n  a  "projection"  mapping  which  may  be  necessary  to  keep  the  parameter 
updates  constrained  suitably. 

The  key  tool  in  the  analysis  of  the  sequence  {6^}  is  the  so-called  interpolated 
process  e°{-)  =  (^°(0,«  >  0),  defined  by 

go^t)  =  l:LZlg^_^  +  LL^:izl0^^        t^_,<t<t,,  (11) 

In  In 

where 

to  =  0,  tn   =  Jl  + h  7r>; 

i.e.,  6°{)  is  obtained  by  piecewise  constant  interpolation  of  {9n}  with  interpo- 
lation intervals  {jn}-  Observe  in  particular  that  d°{tn)  =  9n- 

Kuan  &;  Hornik  (1990)  investigate  the  properties  of  the  interpolated  process 
for  the  case  of  small  constant  learning  rates  and  give  applications  to  Error 
Back-Propagation  in  supervised  learning  and  PCA  feature  extraction  algorithms 


based  on  feedforward  architectures.  In  the  present  paper,  we  shall  always  assume 
that  7n  ^  0  at  a  suitable  rate  as  n  —<■  oo.  In  this  case,  it  can  be  shown  tlial 
^°(-)  eventually  follows  the  solution  paths  of  an  ODE  (Ljung,  1977;  Kushner  L 
Clark,  1978;  Ljung  k.  Soderstrom,  1983).  More  precisely,  Kushner's  method, 
which  we  find  most  convenient  to  use  for  our  purpose, 'proceeds  as  follows. 

Let  us  write  m  for  the  number  of  components  of  9,  and  introduce  left  shifts 
0"(.)  -  (6»"(/),/  >  0)  of  the  interpolated  process  (11)  by  means  of  $"{1)  = 
9°{tn  +  t)\  observe  that  ^"(0)  =  ^°(<„)  =  ^„.  Clearly,  all  processes  ^"(•)  are 
elements  of  C([0,  oo),  IR"^),  the  space  of  all  iR'"-valued  continuous  functions  on 
[0,oo).  Under  suitable  conditions,  it  can  be  shown  that  the  set  of  processes 
{^"(•)}  is  bounded  and  equicontinuous  on  [0,T]  for  all  T  <  oo,  such  that  by 
the  famous  Arzela-Ascoli  Theorem  (see  e.g.  Dunford  &  Schwartz,  1966),  it  is  a 
relatively  compact  subset  of  C([0,  oo),  IR'")  if  this  space  is  given  the  topology  of 
uniform  convergence  on  bounded  intervals.  (I.e.,  for  each  infinite  subsequence 
{m;}  we  can  find  a  subsequence  {nj  }  C  {n/}  such  that  ^"'  (•)  converges  uniformly 
on  bounded  intervals.)  If  for  the  moment  we  assume  that  11  is  just  the  identity 
mapping,  then  the  limits  of  convergent  subsequences  satisfy  the  ODE  6  =  h{0), 
where  the  dot  denotes  the  derivative  with  respect  to  t  and  h  is  obtained  from 
/i  by  a  suitable  averaging  procedure.  More  precisely,  suppose  that,  as  is  the 
case  for  the  feature  extraction  algorithms  we  are  interested  in,  the  learning 
patterns  Zn  can  be  decomposed  as  z^  =  (x„,y„),  where  the  {xn}  are,  say,  a 
sequence  of  independent  random  variables,  and  the  ?/„  are  generated  throughout 
the  algorithm  according  to 

yn   -  g{Xn,yn-l,On-l)  (12) 

with  some  initial  j/o,  e.g.  yo  =  0.  Hence,  in  general,  y„  (implicitly)  depends 
on  all  previous  parameter  updates  ^o, .  .  . ,  9n-i  and  exogeneous  network  inputs 
xi, . . . .  Xn-  For  fixed  9,  define  a  sequence  yn{d)  by  means  of  the  recursion 

yn{9)  =  g{xn,yn-i{0),e)  (13) 

with  initial  condition  yo(^)  =  J/o-  Then,  provided  that  the  limit  exists  and 
suitable  additional  assumptions  are  satisfied,  we  may  take 

h(9)  =    lim  Eh(xr,,y„(9),9).  (14) 

n  — CO 

Now  let  0*  be  the  set  of  all  asymptotically  stable  equilibria  of  the  ODE 
9  =  h{9)  and  V{Q')  its  domain  of  attraction.  If  one  can  show  that  {9n}  enters 
some  compact  subset  of 'D(0*)  infinitely  often,  then  the  above  approach  allows 


to  conclude  that  9n  converges  to  some  6'  ^  Q'  as  n  ^  cc.  However,  the 
verification  of  this  condition,  leading  to  a  global  asymptotic  analysis  of  the 
solutions  of  a  usually  complicated,  nonlinear  system  of  differential  equations, 
unfortunately  appears  to  be  virtually  impossible  for  many  applications  and  in 
particular,  for  most  feature  extraction  algorithms  we  are  concerned  with,  the 
only  exception  that  we  are  currently  aware  of  being  Oja's  one-unit  algorithm 
which  was  fully  analyzed  in  Oja  h  Karhunen  (1985).  In  any  case,  the  above 
characterization  of  the  asymptotic  paths  of  the  interpolated  process  implies 
that  for  a  "good"  algorithm,  the  asymptotically  stable  equilibria,  being  at  least 
locally  attractive  limit  points  of  the  associated  ODE,  should  be  desired  limit 
points  of  the  algorithm  (i.e.  points  one  actually  wants  the  algorithm  to  converge 
to).  On  the  other  hand,  if  none  of  the  desired  limit  points  is  an  eisymptotically 
stable  equilibrium  of  the  ODE,  then  we  expect  the  performance  of  the  algorithm 
to  be  rather  poor.  Therefore,  an  explicit  characterization  of  0*  is  of  fundamental 
importance  in  understanding  the  asymptotic  properties  of  the  algorithm,  even 
if  one  does  not  succeed  in  identifying  ■D(0' ). 

For  applicability  of  Kushner's  ODE  method  as  outlined  above,  as  well  as 
for  reasonable  behavior  of  the  algorithm,  it  is  necessary  that  the  sequence  {On} 
of  parameter  updates  remains  bounded  or  constrained  to  a  suitable  compact 
subset  of  IR."* .  For  example,  we  already  know  that  if  we  use  Foldiak's  architec- 
ture with  symmetric  decorreiation,  then  the  eigenvalues  of  Wn  should  eventually 
be  less  than  one  in  absolute  value.  This  goal  can  be  accomplished  by  imple- 
menting some  projection  device  IT  as  in  equation  (10).  Such  a  device  could 
e.g.  truncate  the  entries  of  9  if  they  become  too  large;  for  feature  extraction 
networks,  this  would  enforce  biologically  very  plausible  limits  to  maximal  in- 
terconnection strengths.  Or,  11  could  project  into  a  lower  dimensional  compact 
subset  of  IR"^ ,  as  is  the  case  if  the  A  part  is  trained  using  hebbian  learning  with 
rowwise  euclidean  normalization.  As  a  rule  of  thumb,  the  ODE  then  becomes 
9  G  DIl{h{9)),  where  DU  is  the  set  of  all  directional  derivatives  of  11;  for  more 
details,  see  e.g.  chapter  5.3  in  Kushner  &  Clark  (1978).  In  particular,  if  trunca- 
tion is  employed  to  constrain  the  updates  to  some  hyperrectangle,  then  we  still 
have  9  —  h{9)  in  the  interior  of  the  hyperrectangle. 

4      Results 

Now  let  us  apply  Kushner's  ODE  method  to  the  general  class  of  PCA  feature  ex- 
traction algorithms  introduced  in  section  2.  In  this  case,  9  —  [vec(/l)',  \ec{W)'\' , 
where  the  "vec"  operator  stacks  one  column  above  the  other.  Observe  that  ba- 


8 


sically  all  algorithms  of  interest  keep  W  constrained  to  a  lower  dimensional 
subspace  VV  of  the  space  of  all  p  x  p  matrices.  Hence,  in  a  nonredundant 
parametrization  9  <-»•  {A,  W),  6  contains  the  entries  of  A  and  the  coordinates 
of  W  with  respect  to  a  suitable  basis  of  W.  In  the  sequel,  we  shall  find  it  no- 
tationally  convenient  to  continue  the  analysis  in  terms  of  A  and  W ,  keeping  in 
mind  that  W  £  VV. 

For  sake  of  simplicity,  let  the  learning  rates  be  the  same  for  both  the  A  and 
the  W  part  of  the  algorithm.  The  updating  equations  are 

A/lr,       =       yniynX'n-'^{yny'n)An-l),  (15) 

AWr,        =       InCliyuV'n),  (16) 

where  the  sequence  {j/n}  is  generated  by  yo  =  0  and 

yn  ^  PiWn-l)yn-l  +  Q{Wn-l)An-iXn,  (17) 

such  that 

ynie)     =     P{W)yn_i{e)  +  Q{W]Axr, 


ri-l 


=       ^P(WyQ{W)AXn.^. 

i  =  0 

For  what  follows,  we  assume  that 

[A  1]   {xn}   is  a  sequence  of  independent,   bounded  random  vectors  with  mean 
zero  and  the  same  covariance  matrix  E. 

[A  2]   {'in}  IS  a  sequence  of  positive  numbers  satisfying 

^7n=oo,  ^7;<oo. 

n  n 

Using  [A  1],  we  have 

Eyn{9)x'n  ^Q{W)AE 

independently  of  n,  and  the  covariance  matrix  of  yn{d)  is  given  by 

n-l 

Eyn(e)yn{9y  =  ^  P(WyQ{W)AEA'Q(Wy p{wy\ 

1  =  0 


Hence,  in  order  to  guarantee  that  the  limiting  covariance  matrix  of  yni9)  exists, 
we  need  all  eigenvalues  of  P{W)  to  be  less  than  one  in  absolute  value,  in  which 
case 

oo 

lim  Eyn{9)yn(0y  ^Y  P{WyQiW)AEA'Q{WYP{W)"  =:  R(A,W) 

n— 'OO  ^— ' 

1  =  0 

and,  by  combining  (14)  with  (15)  and  (16)  and  the  above  computations,  we 
obtain  the  asymptotic  ODE 

A     =     QiW)AE  -  ^(R)  A  (18) 

W    =    QiR)  (19) 

with  R  =  R(A,  W). 

As  P{0)  =  O  by  assumption,  the  above  eigenvalue  condition  is  automatically 
satisfied  if  Q  sets  at  least  all  on-  and  superdiagonal  entries  to  zero  such  that 
W  remains  "at  most"  subdiagonal  throughout  the  algorithm.  In  this  case,  we 
necessarily  have  P{Wy  =  O  for  z  >  p  such  that  R(A,  W)  is  the  sum  of  only 
finitely  many  terms.  The  following  key  result  will  be  proved  in  the  appendix. 

Theorem  1.  Let  {An]  and  {Wn]  be  generated  by  (15)  and  (16),  respectively, 
and  lei  A"(-)  and  W^i)  be  the  left  shifts  of  the  corresponding  interpolated  pro- 
cesses. 

Assume  that  W  contains  only  subdiagonal  matrices  and  that  with  probability  one, 
the  sequence  {An,Wn}  is  bounded.  Then,  with  probability  one,  iA'^{-),W"{-)} 
IS  bounded  and  equicontinuous  on  bounded  intervals.  If  (A(),  W{-))  is  the  limit 
of  a  convergent  subsequence,  it  satisfies  the  ODE 

A     =     Q(W)Ai:-^R)A 
W     =     Q{R). 

with  R  =  RiA,  W)  and  W  eW. 

Let  0*  be  the  set  of  all  locally  asymptotically  stable  equilibria  {A,  W)  of  the 
above  ODE  and  'D(Q')  its  domain  of  attraction.  If  (An,Wn)  enters  a  compact 
subset  S  ofD{Q')  infinitely  often  with  probability  one,  then  {An,Wn)  -^  Q* 
with  probability  one. 

As  already  explained  in  section  3,  boundedness  of  {An,  Wn]  can  always  be 
ensured  upon  combination  of  the  algorithm  with  a  suitable  projection  mech- 
anism which  confines  the  updates  to  a  bounded  subset  of  the  network  weight 
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space.  If  the  updates  are  constrained  to  some  hypercube  by  a  simple  trunca- 
tion device,  the  asymptotic  paths  satisfy  the  above  ODE  in  the  interior  of  that 
hypercube. 

In  the  case  where  W  is  not  constrained  to  subdiagonality,  it  has  to  be  guar- 
anteed that  \Vn  eventually  remains  in  the  stability  region  which  consists  of  all 
W  G  H'  for  which  all  eigenvalues  of  P(W)  are  less  than  one  in  absolute  value 
(i.e.,  the  region  where  R{A,  W)  is  actually  defined).  Again,  this  can  be  ensured 
using  simple  truncation  mechanisms.  As  an  example,  if  P{W)  =  W''  for  some 
^-  >  1  as  is  the  case  in  real-time  applications  based  on  Foldiak's  architecture, 
then  it  is  easily  seen  that  it  suffices  to  keep  \ujij\  <  p~^  for  all  off-diagonal  entries 
uiij  of  W.  To  avoid  unnecessary  technicalities,  we  shall  not  formulate  an  explicit 
theorem  and  continue  to  refer  to  (18)  and  (19)  as  "the  asymptotic  ODE",  al- 
though in  general  it  will  only  be  defined  for  H^  in  a  suitable  neighborhood  of  (9; 
for  more  details,  see  Kushner  k  Clark  (1978,  p.  40  and  section  5.3). 

We  now  turn  over  to  the  investigation  of  the  equilibria  of  the  asymptotic 
ODE  and  their  local  stability  properties.  For  this  purpose,  we  shall  for  sake  of 
simplicity  make  the  additional  assumption  that 

[A3]   .4//  eigenvalues  of  D  are  distinct  and  positive. 

Let  us  write  A,  for  the  j'-th  largest  eigenvalue  of  S  and  u;  for  an  associated  unit 
length  eigenvector  which  is  then  unique  up  to  a  change  of  sign.  In  what  follows, 
it  will  also  be  convenient  to  let  (t  resp.  r  denote  the  coefficient  of  W  in  P{W) 
resp.  Q{W)  such  that 

P{W)  =  aW  +  ---,         QiW)  =  I  +  rW  +  ■■■ 

where  the  dots  indicate  terms  containing  higher  powers  of  W . 

The  equations  for  an  equilibrium  point  of  the  ODE  (18)  and  (19)  are 

QiW)Ai:^^(R)A,         n(R)  =  0,         R=R(A,W).  (20) 

Of  course,  an  explicit  solution  of  these  equations  is  impossible  without  being 
more  specific  about  the  particular  choice  of  $  and  Q.  Assume  in  addition  that 
QiW)  has  full  rank  p;  in  fact,  this  is  the  case  for  all  relevant  W  in  all  applications 
of  our  interest.  We  then  have  the  following  general  result. 

Theorem  2.  Let  A  and  W  solve  (20)  such  that  Q{W)  has  full  rank  p.  Let 
r  :=  rank(,4).  Then  there  exist  a  p  x  r  matrix  C  of  full  rank  r  and  a  sequence 
of  mutually  distinct  indices  I  <  i^,  .  .  .  ,ir  <  d  such  that 

A  =  C[ui„...,u,^]'.  (21) 
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For  the  proof  of  theorem  2,  see  the  appendix.  Geometrically  speaking,  equa- 
tion (21)  says  that  span(.4'),  the  subspace  of  IR  spanned  by  the  rows  of  A, 
equals  span{ii,i ,  .  .  . ,  u,^}. 

Of  course,  as  we  are  interested  in  algorithms  which  extract  the  first  p  prin- 
cipal components,  we  expect  any  "reasonably  good"  algorithm  to  exhibit  the 
following  behavior.  The  set  of  asymptotically  stable  equilibria  should  not  be 
empty,  and  all  asymptotically  stable  equilibria  should  have  rank(.4)  =  p  and 
{ii,  ■  ■  ■ ,  ip]  =  {l,...,p};  all  other  equilibria,  corresponding  to  ^'s  which  ex- 
tract some  "wrong"  (or  not  enough)  principal  components,  should  be  unstable. 
In  addition  to  that,  W  should  at  least  be  subdiagonal  to  allow  the  mature  net- 
work for  finite-time  exact  computation  of  its  output  to  a  previously  unseen  input 
pattern.  Hence  in  particular,  if  W  is  kept  symmetric  throughout  the  algorithm, 
all  asymptotically  stable  equilibria  should  have  W  =  O. 

As  will  be  shown  in  the  appendix,  equilibria  with  rank(i?(.4,  W))  <  p  are 
always  unstable;  hence,  in  the  following  discussion,  we  may  restrict  our  attention 
to  equilibria  with  full  rank  R. 

For  the  local  PCA  algorithms  which  are  the  main  concern  of  this  paper, 
we  have  $  =  diag  and  fi  =  — offdiag  or  Q  =  — subdiag;  in  the  sequel,  we 
shall  refer  to  these  two  variants  as  a  local  PCA  algorithm  "in  symmetric  mode" 
respectively  "in  asymmetric  mode".  In  either  case,  due  to  the  fact  that  R{A,  W) 
is  symmetric,  the  equations  for  an  equilibrium  are 

Q{W)AE  =  dmg{R)A,         ofrdiag(;?)  =  0,         R  =  R(A,  \V).        (22) 

We  have  the  following  result. 

Theorem  3.  .4//  solutions  of  (22)  with  W  =  O  and  full  rank  R  are  such  that 
the  rows  of  A  are  mutually  perpendicular  unit  length  eigenvectors  of  H,  with 
associated  (diagonal)  eigenvalue  matrix  R  =  AT,A' .  //r  ^  0.  these  are  the  only 
equilibrium  points  with  subdiagonal  W  and  full  rank  R. 

Remark.  If  r  =  0,  there  may  be  critical  points  with  subdiagonal,  but  nonzero  W 
and  full  rank  R,  see  example  1  of  the  appendix.  In  the  symmetric  cases,  a  com- 
plete description  of  the  set  of  equilibrium  points  (even  with  full  rank  R)  is  very 
hard.  Example  2  of  the  appendix  shows  that  each  of  the  "desired"  equilibria 
may  actually  lie  on  a  curve  of  equilibria. 

The  local  stability  properties  of  the  local  PCA  algorithms  are  given  in  the 
following  two  theorems.  To  simplify  matters,  let  J  =  {{i,j)  :  1  <  j  <  ^  <  p}, 

S  =  {{a,T)  :  (7  +  r  >  1,  (T  <  0}, 
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and  let 

a,_,((T,  t)  =  Xt  -\-  (cT  +  T  -  l)\j,  0i]{(T,  t)  -  {(7  -\-  r)A,  -  aXj. 

Theorem  4.  For  the  local  PCA  algorithms  in  asymmetric  mode,  equilibria  with 
\V  —  O  are  asymptotically  stable  if  and  only  if 

A  =  [±ui, . .  .,±Uf,]' 

and 

aij{<r,  r)  >  0,  /^,j(cr,  r)  >  0  for  all  {ij)  G  J. 

This  condition  is  satisfied  for  arbitrary  \i  >  ■  ■  ■  >  \p  >  0  if  and  only  if  (a,  r)  G 
S.  If  for  some  i,  the  i-th  row  of  A  does  not  equal  ±u, ,  or  if  aij{a,T)  <  0  or 
0ij{a,  r)  <  0  for  some  {i,j)  €  J ,  the  equilibrium  is  unstable.  This  is  always  the 
case  if  cr  +  T  <  0. 

Corollary.  If{a,T)  G  S,  the  asymptotically  stable  equilibria  of  the  local  PCA 
algorithms  in  asymmetric  mode  are  exactly 

A  =  [±ui ±up]',  W  =  0, 

and  all  other  equilibria  are  unstable. 

For  the  algorithm  of  Rubner  k  Tavian  (1990),  PiW)  =  O  and  Q{W)  ^  I  +  W, 
hence  {a,  r)  —  (0, 1)  G  5  and  the  conclusions  of  the  corollary  apply.  Algorithms 
based  on  the  hierarchical  modification  of  Foldiak's  architecture  which  perform 
k  y-updating  cycles  use  PiW)  =  W'  and  Q{W)  =  I  +  W  +  ■■■+  W''-K  Thus, 
if  fc  =  1,  we  have  a  —  \  and  r  =  0  and  all  equilibria  are  unstable  because  for 

i  >  h 

/?„(1,0)  =  A,-A,  <0. 

\{  k  >  1,  (a,  t)  =  (0,  1)  G  S,  and  again,  the  conclusions  of  the  corollary  apply. 
We  infer  that  during  the  learning  period,  it  is  not  necessary  to  perform  exact 
decorrelation  before  updating  the  network  weights,  thereby  motivating  the  use 
of  "quicker"  feedback  algorithms  with  1  <  ifc  <  p. 

Theorem  5.  For  a  local  PCA  algorithm  in  symmetric  mode,  it  is  neces- 
sary for  an  equilibrium  with  W  =  O  to  be  asymptotically  stable  that  A  = 
[±Un,  .  .  . ,  ±u,,]'  with  {ii,.  .  .,ip}  =  {1,  .  .  .  ,p},  and  a  <  0.  If  a  >  0,  all  equilib- 
ria with  W  —  O  are  unstable. 
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As  real-time  implementations  of  Foldiak's  original  algorithm  use  P{W)  —  W'' 
for  some  ^'  >  1  and  therefore  have  cr  >  0,  we  conclude  that  for  these  algorithms, 
none  of  the  desired  limit  points  is  asymptotically  stable.  Similarly,  as  a  =  0 
if  P{W)  =  O,  we  infer  that  no  feedback-free  local  PCA  algorithm  gives  rise  to 
asymptotically  stable  desired  equililibria  if  run  in  symmetric  mode. 

Therefore,  hierarchical  decorrelation  should  always  be  preferred  over  more 
competitive,  symmetric  decorrelation  mechanisms,  for  reasons  of  superior  per- 
formance of  the  algorithms  —  in  symmetric  mode,  the  desired  limit  points  are 
not  attractive  enough.  In  fact,  it  is  not  too  hard  to  give  a  biologically  plau- 
sible interpretation  of  these  findings.  In  our  cases,  the  local  updating  of  the 
.4  part  of  the  network  weights  forces  the  rows  of  .4  to  be  eigenvectors  of  the 
input  covariance  matrix  S;  in  addition,  if  the  network  wants  to  extract  as  much 
input  information  as  possible,  the  rows  must  not  be  collinear  and  hence  they 
have  to  be  mutually  perpendicular.  Thus,  the  local  structure  forces  the  units  to 
follow  a  very  strict  hierarchy  (of  output  variance).  On  the  other  hand,  if  each 
unit  tries  to  maximize  its  output  variance  (subject  to  identical  constraints)  and 
all  units  are  allowed  to  compete  equally,  then  it  will  take  much  longer  for  a 
hierarchical  structure  to  evolve  than  if  this  hierarchy  is  explicitly  forced  (or  at 
least  strongly  supported)  by  the  network  interconnection  topology.  Of  course, 
allowing  for  competition  in  order  to  obtain  more  "balanced"  representations  is 
quite  attractive;  however,  this  balance  should  be  structurally  stable. 

For  the  subspace  algorithm  (1),  $  is  the  identity  mapping  and  Q  =  0  such 
that  W  =  O.  It  is  shown  in  Baldi  &  Hornik  (1991)  that  all  equilibria  with  full 
rank  R  are  of  the  form  A  =  C[?i,i,  .  . . ,  u,,,]'  where  the  ij  are  mutually  distinct 
and  C  is  an  arbitrary,  orthogonal  p  x  p  matrix.  Therefore,  these  equilibria  are 
not  isolated  and  thus  cannot  be  asymptotically  stable.  This  deficiency  of  the 
subspace  algorithm  was  already  noticed  implicitly  in  Williams  (1985).  It  can  be 
shown  that  the  component  of  small  perturbations  about  an  equilibrium  which 
are  perpendicular  to  span (.4')  die  out  asymptotically  if  and  only  if  {?i,  .  .  . ,  ?p}  = 
{1, .  . .  ,p};  a  proof  of  this  fact,  together  with  a  discussion  of  the  evolution  of  the 
components  along  span(.4'),  see  Krogh  Sz  Hertz  (1990). 

For  Sanger's  GHA  (2),  $  =  lower  and  Q  =  0  such  that  W  =  O.  It  is 
easily  seen  by  induction  (cf.  e.g.  the  proof  ot  theorem  3)  that  all  equilibria 
with  full  rank  R  are  of  the  form  A  =  [±Ui, ,  •  •• ,  ±"ip]'  with  mutually  distinct 
I  <  ii,  ■  ■  ■  ,ip  <  d.  The  following  result  is  implicitly  contained  in  both  Oja  h 
Karhunen  (1985)  and  Sanger  (1989). 
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Theorem  6.    The  asymptotically  stable  equilibria  of  Sanger's  GHA  (2)  are 

A  =  [±ui,...,±Up]'; 
all  other  equilibria  are  unstable. 

5      Remarks  and  Problems 

In  Oja  L  Karhunen  (1985),  it  is  shown  (lemma  5,  p.  80)  that  if  Oja's  one-unit 
algorithm  is  used  with  uniformly  bounded  inputs  and  the  learning  rates  are 
sufficiently  small  (but  do  not  necessarily  tend  to  0),  then  the  weight  updates 
automatically  remain  inside  some  bounded  subset  of  the  weight  space.  It  is  def- 
initely worthwhile  investigating  whether  or  not  multioutput  generalizations  also 
possess  this  very  strong  stability  property,  in  particular  if  local  algorithms  with 
asymmetric  decorrelation  do  so.  If  this  were  the  case,  explicit  truncation  would 
become  obsolete.  However,  this  question  appears  to  be  extremely  challenging. 

If  the  ODE  method  is  to  be  used  for  producing  explicit  convergence  results 
for  on-line  PCA  learning  algorithms,  T>(0*),  the  domain  of  attraction  of  the  set 
of  asymptotically  stable  equilibria,  has  to  be  identified  (or  at  least,  one  should 
succeed  in  exhibiting  a  suitably  "rich"  subset  of  X>(0* )  which  the  updates  could 
be  confined  to).  However,  as  already  indicated  in  section  3,  the  asymptotic 
ODEs  of  multioutput  algorithms  are  quite  complicated,  thereby  making  such 
an  identification  very  hard. 

In  fact,  the  only  case  where  a  complete  global  analysis  of  the  ODE  is  available 
appears  to  be  Oja's  one-unit  algorithm  (Oja  k  Karhunen,  1985).  For  the  sub- 
space  algorithm,  the  analysis  in  both  Williams  (1985)  and  Krogh  &  Hertz  (1990) 
is  strictly  local,  and  it  actually  seems  to  still  be  unknown  whether  an  equilibrium 
of  the  form  A  =  [±ui,  .  . . ,  :tup]'  is  stable  or  unstable  (remember  that  it  cannot 
be  asymptotically  stable).  Sanger  (1989,  p.  463)  claims  that  for  the  GHA,  "the 
domain  of  attraction  (of  Q'  =  {A  :  A  =  [±ui,  .  .  . ,  ±Up]'])  includes  all  matrices 
with  bounded  weights",  which  is  obviously  wrong  due  to  the  existence  of  un- 
stable equilibria  of  the  form  A  =  [±Uij ,  .  .  . ,  ±u,  ]'  with  (i'l ,  .  .  . ,  Zp)  ^  ( 1, .  .  . ,  p). 
Clearly,  for  local  PCA  algorithms  the  global  asymptotic  analysis  is  even  harder. 
However,  exhaustive  computer  simulations  confirm  rapid  convergence  to  0'  in 
asymmetric  mode  even  when  starting  with  very  large  weights  or  initial  configu- 
rations very  close  to  unstable  equilibria. 

Finally,  let  us  mention  that  the  asymptotic  analysis  for  local  PCA  algo- 
rithms which  update  A  using  hebbian  learning  with  explicit  rowwise  euclidean 
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normalization  rather  than  apply  Oja's  one-unit  algorithm  to  each  of  the  rows 
of  A  is  more  or  less  identical  to  the  analysis  presented  here,  cf.  e.g.  the  results 
in  Oja  k  Karhunen  (1985). 
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Appendix:  Mathematical  Proofs 

Proof  of  theorem  1.  Let  0,  be  the  set  of  all  0  =  [vec(/i)',  vec(VV')']',  where 
,4  is  an  arbitrary  p  x  d  matrix  and  W  is  a  subdiagonal  p  x  p  matrix.  We 
proceed  by  applying  theorem  2.5.2  of  Kushner  &  Clark  (1978)  with  the  obvious 
modification  that  as  by  assumption  the  algorithm  keeps  9n  in  0«,  all  conditions 
have  to  be  verified  only  for  ^  in  65.  (Alternatively,  one  could,  at  the  expense 
of  more  complicated  notations,  work  with  a  nonredundant  parametrization  0  — 
{A,W).) 

In  vectorized  form,  the  algorithm  is 


?„    =   9r^-l   +  7„  h{ZnJn- 


where  Zn  =  (Jn,yn), 


h{z,6)  = 


vec{yx'  -^(yy')A) 
vec(Q(j/y')) 


and  the  y^  are  generated  according  to  ( 17).  We  already  know  that  for  all  6  G  Q,, 


lim  Eh{zn(9),e)  = 


veciQ{W)AE-<^(R)A) 
vec{QiR)) 


:=  h{e), 


(23) 


where  of  course  R  =  R{A,  W). 

Applied  to  our  case,  theorem  2.5.2  of  Kushner  L  Clark  (1978)  states  that  if 
conditions  [A2.2.3],  [A2.4.5],  [A2.5.2]  and  [A2.5.3]  of  Kushner  k  Clark  (1978) 
are  satisfied  and  {On]  remains  bounded  with  probability  one,  then  {^"(O}  is 
bounded  and  equicontinuous  on  bounded  intervals  with  probability  one,  and  the 
limits  of  convergent  subsequences  satisfy  the  ODE  9  —  h{9).  As  we  assumed  the 
boundedness  condition,  the  proof  of  theorem  1  is  completed  upon  verification 
of  the  abovementioned  conditions. 

[A  2.2.3]  is  trivially  satisfied.  We  start  with  [A  2.5.2]. 

A  2.5.2.   There  is  a  continuous  function  h{)  such  that  for  some  T  >  0.  for  each 
e  >  0  and  each  9  E  Qs, 


iim    P<  sup  max 

"-<»  l;>r;      t<T 


where 


^       '„{h{zd9),9))-h{9)) 


m{t)  =  max{n  :  („  <  t). 


>  ( 


0, 
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If  h{-)  is  defined  by  (23),  it  is  clearly  continuous  for  6  G  Q,.    Fix  0  ^  Q,  and 
T  >  0,  and  let  hn{0)  =  h{zn{d),d).  Observe  that  for  t  <  T, 

i  =  m(jT) 

for  all  j.    For  each  e  >  0,  we  can  choose  n  sufficiently  large  such  that  for  all 
i  >  rn{jT)  and  j  >  n,  |E  h,{0)  -  h{e)\  <  f/(2r).  Hence  for  n  sufficiently  large, 


sup  max 

j>n     t<T 


<     sup  max 

j>n     t<T 


<     sup  max 

J>n     t<T 


m(jT  +  t)-l 

x-m(jT) 

m{jT  +  t)-l 

J2     yiih,{e)-Eh,{e)) 

i  =  rn{jT) 

m{jT  +  t)-l 

J2       l.(Eh,{9)-h{9)) 

i  =  mUT) 

J2       'n  {hdd)  -  E  h,(9))     +   e/2. 

i  =  m{jT) 


+  sup  max 

j>n     t<T 


Note  that  almost  sure  convergence  of  Yl'i  =  ].  7'  e^il^)  ~  ^  ^i{^))  's  equivalent  to 
the  condition  that  for  each  f  >  0, 


lim    P<  sup 

"^°o  [rn>n 

which  in  turn  implies  that 


J2-^Ah.{9)-Eh,{e)) 


<f/2     =1, 


lim    P^  sup  max 

j>n     t<T 


J^     j,{h,(9)-Eh,{e)) 

t-m{jT) 


<e/2S  =  l. 


Hence  [A  2.5.2]  is  established  if  we  show  that  Yl'i-i ')''  (^i(^)~E  h,{9))  converges 
with  probability  one,  which  can  be  done  by  showing  that  {jn(hn(d)  —  E  h„{6))} 
is  a  mixingale  of  size  —1/2  (or  larger)  with  square  summable  magnitude  indices 
and  using  the  mixingale  convergence  theorem  (McLeish,  1975,  corollary  1.8). 

As  {xn}  is  bounded  by  assumption,  {yn(^)}  is  clearly  bounded  as  well.  By 
lemma  1  in  Andrews  (1989),  vec(y„(^)xn)  and  vec(yn(0)y„(0)')  are  near  epoch 
dependent  (NED,  cf.  Gallant  h  White,  1988)  on  {x„}  of  arbitrarily  large  size. 
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It  follows  by  lemma  3.14  of  Gallant  k  White  (1988)  that  {h„{0)  -  E/i„(^)} 
is  a  bounded  mixingale  of  arbitrarily  large  size  and  with  bounded  magnitude 
indices  {(:„}.  Hence,  {yn(h,i{d)  -  E  hn{9))}  is  also  a  bounded  mixingale  of  ar- 
bitrarily large  size  and  magnitude  indices  {fnCn}  which  is  square  summable  by 
assumption  [A  2]  and  boundedness  of  {cn}.  Therefore,  the  mixingale  conver- 
gence theorem  applies,  and  [A  2.5.2]  is  established. 

If  a  sequence  {y„}  is  generated  by  !/„  —  Pn-iyn-i  +  Cn,  we  find  upon  resub- 
stitution  that  for  all  k, 

k-l 


Vn    =   Ln-\,kyn-k  +  ^  -^n- 1  ,iCn-i, 


1  =  0 


where  Ln-i,o  =  I  and  for  ;  >  1,  Ln-i^k  =  Pn-i  ■  ■  ■  Pn-k-  In  particular,  if  all  Pn 
are  subdiagonal  p  y.  p  matrices,  L^  t  —  O  for  all  k  >  p,  and 


p-i 

t  =  0 


l.iCn- 


(24) 


Now  let  P„  =  PiWr,),  c„  =  Q(Wn-i)An-ix^  with  ^„  =  [vec(,4„)',vec(PF„)']' 
a  bounded  sequence  in  0,.  Using  the  above  formula,  it  is  readily  seen  that  for 
all  f  >  0  we  can  find  S  >  0  such  that  if  {9n}  is  another  sequence  in  0^  satis- 
fying maXm-it<n<m+(  |^n -^r>  I  <  6  and  yn  is  generated  by  y„  =  P{Wn-i)yn-i  + 
Q(\Vn-].)An-iXn,  then  max,n<r,<m+/  \yn—yn\  <  f,  which  in  turn  implies  [A  2.5.3]. 

Let  ®  denote  the  Kronecker  product  of  two  matrices  (see  e.g.  Magnus  & 
Neudecker,  1988)  such  that  vec{LMN)  =  [N'  ®  L)vec(M)  for  all  matrices  L, 
M,  N  of  compatible  dimensions.  Then  vec($(j/i/),4)  =  (A'  0  /)  vec(<I>(yy')), 
and  thus 


\h(z,  e)\  <  \A'  ®  /|   |vec(^(yyO)|  +  |vec(yx^)|  +  |vec(Q(yyO)| 


93( 


94iz) 


Clearly,  yi()  is  bounded  on  bounded  subsets  of  0j.  For  M  >  0,  let  S\f  be  the 
event  that  for  some  /  G  {3,4}  and  some  n,  y;(-n)  >  M ■  As  by  assumption  {9,^} 
and  {jn}  are  bounded  with  probability  one,  (24)  shows  that  {yn}  and  hence 
also  {gii^n)}  remain  bounded  with  probability  one;  hence  limA/_co  P  (^A/)  =  0. 
On  the  complement  oi  S\f, 


sup  max 


m(;T  +  ()-l 


^        7.3/(m) 


t  =  m{jT} 


m(jT  +  t)-l 

<  M    sup  max       Y_^        7i    £    MA. 

t  =  m(jT) 
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Let  e  >  0  be  arbitrary.  As  soon  as  A  <  t/ M , 


P  \  sup  max 


m{jT  +  t)-]. 

^  7.3/(-.) 

i  =  m{]T) 


>f  >  <p(<^u: 


Now  let  M  —*  oci  to  conclude  that  for  all  e  >  0  and  /  G  {3,4}, 


lim    P<  sup  max 


m(jr+()-l 
.■=m(;r) 


>  t 


establishing  [A  2.4.5]  and  thereby  completing  the  proof  of  theorem  1. 

Proof  of  theorem  2.    Letting  M  =  Q{W)-^^{R),  the  first  equation  of  (20) 
becomes  .4S  =  MA.  Hence, 


M  Aui  =  A'Eu,  =  A,  Aui. 


?  =  1 d, 


from  which  we  conclude  that  Au,  is  either  zero  or  an  eigenvector  of  M  with 
eigenvalue  A,.  As  by  [A3]  all  A,  are  distinct,  the  nonzero  Au,  are  linearly  inde- 
pendent by  a  well-known  result  from  linear  algebra;  on  the  other  hand,  the  num- 
ber of  linearly  independent  (and  thus  nonzero)  Aui  equals  rank(A  [ui,  .  .  . ,  u^])  = 
rank(.4)  =  r.  Now  let  Uo  =  ["i,,  •  ■  • ,  "iJ-D  where  £)  is  a  suitable  diagonal  ma- 
trix with  entries  ±1  and  ?i,  .  .  . ,  zV  is  some  permutation  of  the  indices  in  Jq  = 
{i  :  Auj  /  0);  similarly,  let  j'l, .  .  .  ,jd-r  be  the  indices  in  I±  =  {i  :  Aui  =  0}  ar- 
ranged in  ascending  order  and  [^j.  =  [uj^, . .  ■,Uj^_^].  Then  clearly  C  :=  AUoD 
has  full  rank  r,  AUi_  =  O,  such  that  finally,  as  u,  and  Uj  are  perpendicular 
for  i  ^  J  and  thus  /  =  UoU{>  +  U^Ul,  A  =  A{UoU(>  +  U±Ul)  =  AUqUI)  = 
C[ui,, . .  .,u,J'. 

Proof  of  theorem  3.  If  \V  =  O,  the  equations  for  a  critical  point  give 

AE  =  diag{R)A,         off"diag(7?)  =  0,  R  =  .4E.4'. 

If  R  is  full  rank,  so  is  ,4.  It  follows  that  the  rows  of  .4  are  eigenvectors  of  S  with 
corresponding  eigenvalue  matrix  R,  and  that 

.4E  =  diag(.4E.4')A  =  .4S.4'.4. 

Multiplying  by  A'  from  the  right  we  finally  conclude  that  .4,4'  =  /,  whence  the 
first  assertion  of  the  theorem. 
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The  proof  of  the  second  assertion  is  more  involved.  Let  e,  denote  the  j-th 
Cartesian  unit  vector,  and  let  ai  =  ej.4  be  the  i-th  row  of  .4  so  that  A'  — 
[oi, . .  .,ap].  The  equations  for  a  critical  point  are  ofFdiag(/?)  =:  O  and,  taking 
transposes  for  notational  convenience. 


EA'Q(Wy  =  A'diagiR). 


(25) 


Observe  that  R  is  a.  nonnegative  definite  diagonal  matrix  which  is  full  rank  by 
assumption.  Hence,  all  its  diagonal  entries  p,  =  e'iRei  are  strictly  positive.  If 
W  is  subdiagonal,  we  can  write  W  as 


"0      W2,l       i^3,l 
0  U)3  2 


w  = 


0      tjp,p_i 
0 

Trivially  W'e\  —  0,  and  thus  Q[W)' P{W)'"^e\  equals  ex  for  m  =;  0  and  zero  for 
m  >  0.  Hence, 

pi  =  e[Rex  =  e[AT.A'ex  =  a[^ax 


and 


pioi  =  .4'diag(/?)ei  =  i:A'Q{W)'ex  =  EA'd  =  Eai 


As  pi  >  0,  we  conclude  that  ai  is  a  unit  length  eigenvector  of  S  with  corre- 
sponding eigenvalue  pi. 

We  now  proceed  by  induction.  Suppose  we  have  already  shown  that  for 
all  i  <  /,  Wci  =  0  and  that  the  Oi  are  mutually  perpendicular  unit  length 
eigenvectors  of  S  with  corresponding  eigenvalues  p, .  Then,  if  i  <  /,  we  have 
Q{Wye,  =  e.  and  P(W)"^e,  =  0  for  m  >  0;  if  both  ij  <  I, 


/.v^.  _ 


Pi'     «  =  ; 
0,       i  f  J. 


Observing  that 


we  obtain 


Wei  =  ^l,iei  + h  w,,/_ie/_i, 


W'-e,  =  W'{uii_iei  + htj;,,_ie;_i)  =  uniW'ei  + \-  ^ij-iWei^i 

and  thus  also  W"^et  -0,m>2.  In  particular,  Q{Wyei  -  [I  +  rlF')e,. 
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Now  let  i  <  I.  As  R  is  diagonal, 

0  =  e;/?e,  =elA^A'Q{Wye,  =  a[EA'(I  +  TVV')e!  =  ajSa,  +  rp,a;,,,. 

On  the  other  hand,  multiplication  of  equation  (25)  by  a',  from  the  left  and  by 
6/  from  the  right  gives 

a'.llai  +  Tp^uii  =  pi  a[ai. 

Combining  both  equations  and  using  that  pi  is  positive  by  assumption,  we  con- 
clude that  for  all  i  <  /,  a',ai  =  0.    Thus,  rpju;/ ,  =  — aJSa/  =  —rpiaUit  —  0, 
whence  w;  j  =  0  because  both  r  and  p,  are  nonzero.    Summing  up,  Wei  —  0 
and  a;  is  perpendicular  to  all  a,  with  i  <  I. 
This  finally  gives 

pi  —  e'lRet  —  a'iT,ai 

and 

p,a,  =  A'd\8ig(R)ei  =  S^'Q(iy)'e,  =  S^'e,  =  Sa,, 

hence  at  is  a  unit  length  eigenvector  of  S  with  corresponding  eigenvalue  pi, 
completing  the  induction  step. 

Example  1.  As  an  example  for  the  possible  existence  of  equilibria  with  full 
rank  R  and  subdiagonal,  but  nonzero  [V  in  the  case  where  r  =  0,  consider 
P(W)  =  W,  Q{W)  =  I,  and  p  =  2.  (The  example  may  trivially  be  extended  to 
larger  values  of  p.)  Let  us  write  w  =  u^y  for  the  only  nonzero  entry  of  W ,  let  u  be 
a  unit  length  eigenvector  of  S  with  associated  eigenvalue  A,  and  let  A  —  [u,  0]'.  It 
is  easily  seen  that  for  all  w,  we  have  AT,  —  RA,  where  R  —  R{A,  W)  is  diagonal 
with  entries  A  and  u'-A.  Hence,  whenever  u;  ^i^  0,  {A,  W)  is  an  equilibrium 
with  subdiagonal,  nonzero  W  and  full  rank  R.  This  example  also  shows  that  if 
r  =  0,  there  may  be  equilibria  with  subdiagonal,  nonzero  W ,  full  rank  /?,  and 
rank  deficient  .4. 

Example  2.  We  now  show  that  equilibria  with  W  —  O  can  actually  lie  on 
a  whole  curve  resp.  surface  of  equilibria  with  symmetric  W.  Let  P{W)  =  W, 
Q{W)  =  I,  and  again  for  sake  of  notational  simplicity  let  p  =  2  (the  example 
also  works  for  p  >  2).  Consider  an  equilibrium  with  W  =  O  and  .4  =  [u,  i']', 
where  u  and  v  are  mutually  perpendicular  eigenvectors  of  E  with  corresponding 
eigenvalues  A  and  p..  Whenever  u)~  <  min(A//i, /x/A),  we  can  find  a  =  a(<*')i 
(3  —  0{ijlj)  satisfying 

2  1  2  ^  ,32  1  2 

a=l— u;— ,  /y=l— u; 


A' 


/^ 


22 


such  that 


a-X  +  ^S-^i     =     (l-u;-)(A  +  /x), 
a-A-/?V     =     (l+^')(A-/i). 


Now  let 


0      u; 
u     0 


72 


1 


1 


/IP)  =  [a(w)u,/?(u;)t;]',  H^(u;)  = 

Then  clearly  A{uj)i:  =  diag(A, /i).4(w),  and,  as  W{uj)  =  V'D(w)\/',  where 

D{u))  =  diag(u;,  —ui), 

we  have 
R{A(u;)^V{u;)) 

oo 

a(u;)'A 


1  =  0 

oo 

1  =  0 

1    "^ 


iSiuffi 


vD{ujyv' 


1 


1  =  0 


V 


Diuyv 


A 


{a(u>fX  +  f3{u)-^)/{l  -  Lo'-)     (a(c^)2A  -  /?(cj)V)/(l  +  ^ 
X  +  fi     X  —  1^1 


v 


X  —  pL     A  +  /i 


v 


/ij 


Hence,  whenever  u;'  <  min(A//i,  /x/A),  A{u)  and  Vy(u;)  satisfy  (22).  As  'u>  varies, 
we  obtain  a  curve  of  distinct  equilibria,  which  for  u;  =  0  contains  the  desired 
equilibrium  A,  W . 

Stability  analysis  of  the  equilibria.  For  the  remaining  proofs  of  theorems  4 
to  6,  it  will  be  convenient  to  start  the  analysis  of  the  local  stability  properties 
of  the  equilibria  at  a  general  level  which  in  fact  pertains  to  all  PCA  algorithms 
considered  in  this  paper.  We  shall  continue  to  use  the  notations  introduced  in 
the  proof  of  theorem  2;  in  addition  to  that,  let  Co  =  AUq  and  let  Aq  and  Aj. 
denote  the  diagonal  matrices  with  entries  Ajj , . . . ,  Aj^  respectively  Aj, ,  .  .  . ,  Xj^_^ 
such  that  HUq  =  UqA-o  and  HUx  =  U±A±. 
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To  investigate  the  local  stability  properties  of  an  equilibrium,  we  have  to 
consider  the  evolution  of  small  perturbations  about  the  equilibrium.  Let  E  and 
H  be  the  perturbations  of  .4  and  W,  respectively.  After  linearization  we  obtain 

E     =     Q(W)E'E:  +  dQ(H\W)A'L-^{R)E -^dR(E,H;A,W))A  (26) 
H     =     Q{dR{E,H\A,W)),  (27) 

with  the  additional  constraint  H  G  VV;  here,  dQ{H;  W)  etc.  denote  Frechet 
differentials,  i.e. 

Q{\V  +  eH)  =  Q{\V)  +  €  dQ{H;  W)  +  0(r)         as  e  —  0. 

Now  notice  that  dR  depends  on  E  only  via  EILA'  =  EY^UqCq  =  EUoh-oCo 
and  its  transpose;  hence,  dR.(E,  H\  A,  W)  is  of  the  form  T(EUo,  H;  A,  W).  Let 
Eo  =  EUo  and  E^  =  EU±  .  As  A'^Ux  =  AU^A^  =  O,  (26)  and  (27)  are 
equivalent  to 

E±     =     Q(\V)ExAi-^(R)Ex,  (28) 
£"0     =     Q{W)EoAo  +  dQ{H-W)CoAo 

-^{R)Eo-<^{T{Eo,H;A,W))Co,  (29) 

H     =     Q(T{Eo,H-A,\V)).  (30) 

Equations  (28)  resp.  (29)  describe  the  evolution  of  perturbations  of  .4  perpen- 
dicular to  resp.  along  span(/l'),  expressed  in  terms  of  the  bases  given  by  the 
columns  of  [(l  resp.  Uq-  Obviously,  equation  (28)  does  not  depend  upon  £"0  and 
//,  whereas  (29)  and  (30)  are  independent  from  E^. 
Writing  v±  —  vec(£'j_),  (28)  becomes 

v±  =  (A±®Q{W)-I®^R))vx 

and  we  immediately  deduce  the  following  result. 

Lemma.  Let  A,  W  be  an  equilibrium  of  the  asymptotic  ODE.  Then  small 
perturbations  of  A  perpendicular  to  span(.4')  die  out  asymptotically  if  and  only 
if  all  eigenvalues  of  the  matrix 

K{A,  W)  =  Ax®Q{W)-I®  diag(/2) 

have  negative  real  parts. 

Hence,  an  equilibrium  A,  W  can  only  be  asymptotically  stable  if  the  above 
condition  is  satisfied;  in  particular,  R  has  to  be  full  rank.  If  one  of  the  eigenvalues 
of  K{A,  W)  has  positive  real  part,  the  equilibrium  is  unstable. 
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In  all  subsequent  proofs,  equilibria  with  W  =  O  are  of  interest.    In  these 
cases,  as  R{CoU(i,0)  =  CoAoa,  dQ(H,0)  =  tH  and 

dRiE,  H-CULO)  =  {a  +  t)HR  +  EqAoC^  +  CqAqE^  +  ((t  +  t)RH', 

(29)  and  (30)  become 

Eo     =     EoAo  +  r  HCqAq  -  HR)Eo 

-^{{a  +  t)HR+  EoAoa  +  CqAq^o  +  (<t  +  r)RH')Co     (31 ) 
H     =     Q{{a  +  T)HR+EoAoC^+CoAoEl>+ia  +  r)RH')  (32) 

with  H  G  VV. 

Proof  of  theorems  4/5.  If  R  is  not  full  rank,  the  above  lemma  yields  that 
the  equilibrium  is  unstable.  Hence,  suppose  that  R  is  full  rank.  Using  the  first 
assertion  of  theorem  3,  we  infer  that  we  may  take  Uo  =  A  such  that  Ao  =  R 
and  Co  =  /.  As  the  eigenvalues  of  K{Uo,0)  =  Aj_  ®  /  —  /  0  Ao  are  \j  —  A, 
with  {i,j)  G  Jo  X  2" J.,  the  lemma  implies  that  the  equilibrium  can  only  be 
asymptotically  stable  if  Jo  =  {I,  .  . .  ,p},  and  is  unstable  otherwise. 

As  R  =  Ao  is  diagonal  and  H  has  zero  diagonal,  we  find  that  diag(///?)  = 
di&g{RH')  =  O  such  that  (31)  and  (32)  simplify  to 

^0     =     EoR  +  tHR-  REo-2diag{EQ)R.  (33) 

H     =     n{ia  +  T)HR-\- EoR+ RE^  +  (<r  +  T)RH'),  (34) 

where  of  course  Q  =  — offdiag  in  symmetric  mode  and  Q  =  — subdiag  in  asym- 
metric mode.  The  proof  can  now  easily  be  completed  by  noticing  that  (33)  and 
(34)  can  be  decomposed  into  independent,  one-  resp.  three-dimensional  sub- 
problems.  In  fact,  let  ^,j  and  j],j  be  the  (?,  j)-th  component  of  Eq  resp.  H .  and 
denote  the  z-th  diagonal  entry  of  R  by  p,.  For  i  =  j,  we  have  r;,,  =  0  and 

^11  =  -2/9,6i> 

which  tends  to  zero  exponentially  fast. 

Now  let  i  >  j.    In  symmetric  mode,  we  have  rjij  =  r/^,  and  we  obtain  the 
systems 

where 

6j  1  fP;  -  Pi  rpj 

'/.;      .  K'ji(^>^)=         -Pj         -(<^  +  -r)(p.  +P;)         -Pt 

.^j.  J  L  rpi  p,  -  pj 
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pj  -  pi 

^pj 

-Pi 

-((^  +  ^)(P.  +P;)          -Pi 

Pi  -  Pi 

It  is  straightforwardly  computed  that 

det(M/_,(<T,  r))  =  a{p,  -  Pjf(pi  +  Pj). 

As  the  degree  of  the  characteristic  polynomial  4>'j(\)  =  det{M'j{cr,  t)  —  X  I)  is 
3,  its  roots,  being  the  eigenvalues  of  M'Acr,  r),  are  either  all  real,  or  one  is  real 
and  two  are  complex  conjugate.  Hence,  if,  as  has  to  be  the  case  for  asymptotic 
stability,  all  eigenvalues  have  negative  real  parts,  the  determinant  has  to  be 
negative,  which  is  only  possible  if  a  <  0.  On  the  other  hand,  if  (t  >  0,  then  the 
determinant  is  positive,  hence  at  least  one  of  the  eigenvalues  has  positive  real 
part  and  the  equilibrium  is  unstable,  thereby  completing  the  proof  of  theorem  5. 

In  asymmetric  mode,  we  have  r/j,  =  0  (remember  that  i  >  j)  and  we  obtain 
the  systems 

.      ^i;  =  M°((T,r)i',_,-, 

where  now 


M,^((7,r) 


Hence,  one  eigenvalue  of  M^Aa^r)  is  pi  —  pj ,  from  which  we  conclude  that 
the  equilibrium  can  only  be  asymptotically  stable  if  the  p,  are  arranged  in 
descending  magnitude,  which  together  with  the  condition  Iq  —  {l,...,p}  we 
already  established  implies  that  p\  —  \\,  .  .  . ,  pp  =  \p  and  .4  =  [±ui ,  .  .  . ,  ±Up]', 
and  that  otherwise  it  is  unstable.  In  the  case  where  p\  =  Aj, .  .  .  ,pp  =  Ap,  the 
remaining  eigenvalues  are  the  roots  of  the  equations 

A'  +  (A,  -  A;  +{a  +  t)X^  )  a  +  (^  +  r)A^(A,  -  A^- )  +  rA;  =  0. 

■^ '         ^ V ' 

=  a,;(<T,  r)  =Xjf3,j(a,T) 

It  is  easily  seen  that  both  roots  of  the  quadratic  polynomial  A-  +  qA  +  /?  have 
negative  real  parts  if  and  only  if  both  a  >  0  and  /?  >  0,  and  that  at  least  one 
root  has  positive  real  part  if  a  <  0  or  /?  <  0.  Applied  to  our  case,  we  thus 
have  asymptotic  stability  if  and  only  if  for  all  (i,j)  £  J  we  have  aij(a,  r)  >  0 
and  l3ij{cr,  t)  >  0,  and  conversely,  if  for  some  {i,j)  ^  J  we  have  Q,j(cr,  r)  <  0 
or  f3ij{(T,T)  <  0,  the  equilibrium  is  unstable.  In  particular,  this  is  the  case  if 
<T  -(-  r  <  0,  because  then  Pij((T,  r)  <  Xi  —  Xj  <  0. 
Finally,  it  is  clear  that  the  inequalities 

A,  +  (cr  +  r  -  l)Xj  >  0,  (a  +  r)Ai  -  <tA_,  >  0 
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can  be  valid  for  arbitrary  Ai  >  ■  •  •  >  Ap  >  0  and  ii,j)  G  J  if  and  only  if 
a  +  T  >  I  and  o"  <  0,  i.e.  if  and  only  if  (cr,  r)  G  5,  and  the  proof  of  theorem  4  is 
complete. 

Proof  of  the  corollary.  If  W  is  subdiagonal,  QiW)  is  lower  diagonal  with 
diag(Q(l^V'))  =  /  and  hence  full  rank.  By  our  lemma,  all  equilibria  with  rank 
deficient  R  are  unstable.  Now  observe  that  {a,  r)  G  «S  implies  in  particular  that 
r  >  1,  hence  by  theorem  3  all  equilibria  with  full  rank  R  are  of  the  form  .4  — 
[±u,,, .  .  . ,  ±u,p]'  and  W  =  O.  By  theorem  4,  these  equilibria  are  asymptotically 
stable  if  and  only  if  zj  =  l,...,ip  =  p,  and  unstable  otherwise,  whence  the 
corollary. 

Proof  of  theorem  6.  For  Sanger's  GHA,  W  =  0  and  hence  Q{W)  =  I.  Again 
using  the  lemma,  all  equilibria  with  rank(i?)  <  p  are  unstable.  The  equilibria 
with  full  rank  R  can  be  shown  to  be  .4  =  [iui, ,  ••  • ,  i"«p]'  such  that  again  we 
may  take  Uo  =  A,  Aq  =  R,  and  Co  =  /,  and  by  applying  the  lemma  once  more 
it  follows  that  the  equilibrium  is  unstable  if  Jo  ^  {1,  •  .  .,p}  Trivially  H  =  O, 
and  (31)  becomes 

Eo  =  EoR-  REo  -  loweriEoR  +  RE(,) 

which  now  decomposes  into  one-  and  two-dimensional  subsystems.  For  i  =  j, 
we  have 

which  tends  to  zero  exponentially  fast,  and  for  i  >  j  we  obtain 


6; 


-pi      -pi 
Pi  -  pj  J 


and  we  immediately  deduce  that  the  equilibrium  is  asymptotically  stable  if  and 
only  if  ?i  =  I, . . .  ,ip  =  p,  and  otherwise  unstable. 
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